4 Why Self-Attention

「self-attention層の様々な側面を、ある可変長のシンボル表現の系列xを同じ長さの別の系列zへの写像（典型的な系列変換encoderあるいはdecoderの隠れ層のような）に広く使われるrecurrent層や畳み込み層と比べる」

Motivating our use of self-attention we consider three desiderata.

「3つの望みのもの」

One is the total computational complexity per layer.

「層ごとの計算機的な合計の複雑さ」

Table 1（rについては下で別途取り上げ）

r the size of the neighborhood in restricted self-attention.

Another is the amount of computation that can be parallelized

「並列化できる計算量」

「必要な順次操作の最小の数として計測できる」とも

The third is the path length between long-range dependencies in the network.

「ネットワーク中のlong-rangeの依存の間のパスの長さ」

One key factor affecting the ability to learn such dependencies is the length of the paths forward and backward signals have to traverse in the network.

「そのような依存を学習する能力に影響する1つの鍵となる要因は、順伝播・逆伝播する信号がネットワークの中をたどるパスの長さである」

The shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies

「入力系列と出力系列における位置の任意の組合せの間のこれらのパスが短いほど、long-rangeの依存は学習しやすくなる（とされる）」（Reference 12）

1点目に関して、Table 1のr

To improve computational performance for tasks involving very long sequences, self-attention could be restricted to considering only a neighborhood of size r in the input sequence centered around the respective output position.

「非常に長い系列に関するタスクについて計算機的な性能を改善するために、入力系列のそれぞれの出力位置を中心に近傍サイズrだけにself-attentionは制限されうる」

As side benefit, self-attention could yield more interpretable models.

Appendixを案内（headの訓練結果）